Statistical inference and data mining: false discoveries control
نویسندگان
چکیده
Data Mining is characterised by its ability at processing large amounts of data. Among those are the data ”features”variables or association rules that can be derived from them. Selecting the most interesting features is a classical data mining problem. That selection requires a large number of tests from which arise a number of false discoveries. An original non parametric control method is proposed in this paper. A new criterion, UAFWER, defined as the risk of exceeding a pre-set number of false discoveries, is controlled by BS FD, a bootstrap based algorithm that can be used on oneor two-sided problems. The usefulness of the procedure is illustrated by the selection of differentially interesting association rules on genetic data.
منابع مشابه
The Effect of False Correction Strategy and Inference Strategy on the Paramedical Students’ Reading Comprehension and Attitude
There is a bulk of studies supporting the positive effect of strategy instruction on reading comprehension. This study examined the effect of two reading strategies (i.e., false correction and inference strategy) on English reading comprehension of Iranian paramedical students, using a pretest, posttest, control group design. It also surveyed their attitudes toward the effect and usefulness of ...
متن کاملA Tutorial on Statistically Sound Pattern Discovery
Statistically sound pattern discovery harnesses the rigour of statistical hypothesis testing to overcome many of the issues that have hampered standard data mining approaches to pattern discovery. Most importantly, application of appropriate statistical tests allows precise control over the risk of false discoveries — patterns that are found in the sample data but do not hold in the wider popul...
متن کاملControlling false discoveries in genome scans for selection.
Population differentiation (PD) and ecological association (EA) tests have recently emerged as prominent statistical methods to investigate signatures of local adaptation using population genomic data. Based on statistical models, these genomewide testing procedures have attracted considerable attention as tools to identify loci potentially targeted by natural selection. An important issue with...
متن کاملAssociation Rule Interestingness: Measure and Statistical Validation
The search for interesting Boolean association rules is an important topic in knowledge discovery in databases. The set of admissible rules for the selected support and con dence thresholds can easily be extracted by algorithms based on support and con dence, such as Apriori. However, they may produce a large number of rules, many of them are uninteresting. One has to resolve a two-tier problem...
متن کاملStatistical Detection of EEG Synchrony Using Empirical Bayesian Inference
There is growing interest in understanding how the brain utilizes synchronized oscillatory activity to integrate information across functionally connected regions. Computing phase-locking values (PLV) between EEG signals is a popular method for quantifying such synchronizations and elucidating their role in cognitive tasks. However, high-dimensionality in PLV data incurs a serious multiple test...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2006